Bioinformatics (Thomas Dandekar, Meik Kunz)

317

investigate the promoter region for transcription factor binding sites (TFBS). Transcription

factors (TFs) recognize and bind to specific DNA motifs (DNA binding sites) in the pro

moter, called TFBSs, and thus regulate transcription. If I know the consensus sequence of

the TFBS (template), i.e. the DNA nucleotides to which the TF binds, I can also easily

bioinformatically investigate an unknown sequence for possible binding sites, which I can

then use for further experimental investigations. Appropriate software is already available

for this purpose. Apart from programs that list experimentally validated TFBS (such as

MotifMap), there are also numerous programs that predict TFBS, e.g. ALGGEN PROMO,

PRODORIC (Prokaryotic Database of Gene Regulation), TESS (Transcription Element

Search System) or Genomatix. It is useful to always use several programs to compare

results and find common TFBS. As these programs disappear so often from the open

accessible internet as they can be commercially used and sold, we recently published

AIModules, which offers TFBS analysis including conserved TFBS modules in different

promotor regions (Aydinli et al., 2022; https://aimodules.heinzelab.de/#/)

A computer program for promoter analyses would first “learn” the TFBS, this is done

using stochastic models, e.g. PSSMs or HMMs. In a further step, the program would then

read in a promoter sequence (read-in part) and then search for similarities with the

consensus sequence found (internal calculation part, e.g. with a BLAST), which are then

in turn output as hits (output part).

Possible challenges and sources of error are, for example, that several DNA sequences

are necessary to create the template, i.e. the more binding sites the training data set con

tains, the more accurately the template can also be trained. Statistical parameters should

also be considered. TFs also often bind to DNA combinatorially at a certain distance from

each other, and there are also other elements that influence transcription, such as enhanc

ers. All these factors and challenges should be taken into account by a program to enable

accurate prediction. In any case, it is advisable to validate bioinformatically predicted

TFBS experimentally. Only then can I be sure that the TF actually has an effect on tran

scription. Otherwise, only the DNA nucleotides of the prediction match (that’s why I got

a hit; false positive hits), but this has no biological relevance.

Example 3.9

C, D (please also look at the previous answers).

ALGGEN PROMO should find numerous TFBS for the example sequence, including

NF-AT2 [T01945].

If something did not work for you, then try it best like this. In ALGGEN PROMO, select

the option “SearchSites” (under Step 2) and copy the sequence into the search window, then

start the search (please make sure that the default “Maximum matrix dissimilarity rate“is set

to 15; this specifies the maximum deviation from the actual DNA nucleotide sequence [tem

plate] of the TFBS that is allowed, you can also change this parameter yourself and observe

what happens). As output you will see all TFBS found, their position and score (under Data

[txt] you can also display a list of the TFBS found and the corresponding TF).

Example 3.10

Hidden Markov models are stochastic probability models that predict hidden system

states (e.g. exon, intron) from a sequence (observations, e.g. ATCCCTG...) using a Markov

20.3 Genomes – Molecular Maps of Living Organisms